Nonblocking Checkpointing for Optimistic Parallel Simulation: Description and an Implementation

نویسندگان

  • Francesco Quaglia
  • Andrea Santoro
چکیده

This paper describes a non-blocking checkpointing mode in support of optimistic parallel discrete event simulation. This mode allows real concurrency in the execution of state saving and other simulation specific operations (e.g. event list update, event execution), with the aim at removing the cost of recording state information from the completion time of the parallel simulation application. We present an implementation of a C library supporting non-blocking checkpointing on a myrinet based cluster, which demonstrates the practical viability of this checkpointing mode on standard off-the-shelf hardware. By the results of an empirical study on classical parameterized synthetic benchmarks we show that, except for the case of minimal state granularity applications, non-blocking checkpointing allows improvement of the speed of the parallel execution, as compared to commonly adopted, optimized checkpointing methods based on the classical blocking mode. A performance study for the case of a Personal Communication System (PCS) simulation is additionally reported to point out the benefits from non-blocking checkpointing for a real world application. Index-Terms: Parallel Discrete-Event Simulation, Optimistic Synchronization, Checkpointing, Myrinet, DMA, Performance Optimization.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

On the Trade-off between Time and Space in Optimistic Parallel Discrete-Event Simulation

Optimistically synchronized parallel discrete-event simulation is based on the use of communicating sequential processes. Optimistic synchronization means that the processes execute under the assumption that synchronization is fortuitous. Periodic checkpointing of the state of a process allows the process to roll back to an earlier state when synchronization errors occur. This paper examines th...

متن کامل

On Coordinated Checkpointing in Distributed Systems

Coordinated checkpointing simplifies failure recovery and eliminates domino effects in case of failures by preserving a consistent global checkpoint on stable storage. However, the approach suffers from high overhead associated with the checkpointing process. Two approaches are used to reduce the overhead: First is to minimize the number of synchronization messages and the number of checkpoints...

متن کامل

An Enhanced MSS-based checkpointing Scheme for Mobile Computing Environment

Mobile computing systems are made up of different components among which Mobile Support Stations (MSSs) play a key role. This paper proposes an efficient MSS-based non-blocking coordinated checkpointing scheme for mobile computing environment. In the scheme suggested nearly all aspects of checkpointing and their related overheads are forwarded to the MSSs and as a result the workload of Mobile ...

متن کامل

The STAR Fault Manager for Distributed Operating Environments. Design, Implementation and Performance

This paper presents the design, implementation, and performance evaluation of a software fault manager for distributed applications. Dubbed ST A R , it uses the natural redundancy existing in networks of workstations to offer a high level of fault tolerance. Fault management is transparent to the supported parallel applications. To improve the response time of fault-tolerant applications, ST A ...

متن کامل

Multiprogrammed non-blocking checkpoints in support of optimistic simulation on myrinet clusters

CCL (Checkpointing and Communication Library) is a software layer in support of optimistic Parallel Discrete Event Simulation (PDES) on myrinet-based COTS clusters. Beyond classical low latency message delivery functionalities, this library implements CPU offloaded, non-blocking (asynchronous) checkpointing functionalities based on data transfer capabilities provided by a programmable DMA engin...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • IEEE Trans. Parallel Distrib. Syst.

دوره 14  شماره 

صفحات  -

تاریخ انتشار 2003